1. Introduction
1.1 Data loading
1.2 Data scaling
2. Broad analysis
2.1 Boxplots
2.2 Densityplot
2.3 K-means clustering
2.4 PCA
3. Specific analysis: Lapatinib
3.1 Analysis of biomarker and t-test between treated and untreated cells
3.1.1 Expression of biomakers before and after the treatment of Lapatinib
3.1.2 Searching for our own biomarkers
3.1.3 Boxplot of our biomarkers
3.1.4 ggBoxplot of our biomarkers’ basalexpression in breast cells with Kruskal-Wallis test
3.1.5 Heatmap of biomarkers in breast cells
3.2. Paired t-test between the Lapatinib treated und untreated data
3.2.1 Genome wide paired t-test
3.2.2 Paired t-test of our biomarker before and after treatment
3.2.3 Paired t-test over genes, increased confidence level (99 %)
3.2.4 Visualization of the genome-wide paired t-test in a vulcano plot
3.3 K-means
3.4 PCA
3.5 Spearman correlation
4. Main questions
4.1 Question 1: Does the cell doubling time correlate with reduced drug sensitivity?
4.1.1 Linear regression ~ Inoculation_Density
4.1.2 Linear regression ~ Doubling-time
4.1.3 Multiple regression
4.1.4 Table of conclusion
4.2 Question 2: Does Lapatinib have a similar effect on lung cancer as Erlotinib?
4.2.1 Heatmap
4.2.2 Plot Erlotinib and Lapatinib, colored by tissue
4.2.3 Pearson correlation
4.2.4 Lung cellines
4.2.5 Anova
4.3 Question 3: Does Lapatinib have a similar effect on brain metastasis as it does on breast cancer?
4.3.1 Volcano plot
4.3.2 Heatmap for visualization
4.3.3 Investigation of correlation
Our project is about Lapatinib and divided into three parts: Broad analysis, specific analysis and our three main questions. The main questions are the most important part of our project. They address whether cell doubling time correlates with reduced drug sensitivity and the effect of lapatinib on lung cancer and brain metastasis. Twenty-five per cent of breast cancers overexpress ErbB2 / HER which leads to a more aggressive phenotype. Lapatinib blocks tyrosine kinases that belong to the HER2 receptor, which is increasingly found on the cell surface of cancer cells. HER2 is a receptor for the epidermal growth factor (EGF), which stimulates the cell division of cancer cells. By blocking these HER2 receptors, the unregulated growth of cancer cells can be controlled again. Lapatinib binds the ATP-binding site of the receptor’s intracellular domain and inhibits the survival and proliferation pathways of the tumor cell.
After checking for normalization, we scaled our data in the first place to provide the scaled data for further analysis.
As the very first step in our data analysis, we checked the normalization of the data.
In order to check whether there is a connection between the various drugs, we have colored the plot by drug.
As shown, there is a connection between the different “boxes” and the different drugs. This could be the reason for different batches so we scale the data.
As you can see the scaling is necessary to compensate the batch effect. Scaling is then also applied on Untreated before it is further analyzed.
As an other method to get in touch with the data, we checked if there is a difference between the treated and untreated genexpression. The abline shows the 3 quantiles ( 25 % 50 % 75 % )
As assumed, the expression patterns don’t differentiate much.
To look for clusters in the raw data we performed a k-means clustering and searched for potentially clusters using the most variable (>75 % Var), and thus the most informative genes.
Running a loop for the best cluster-amount (searching for an “elbow”)
As we wanted an “elbow”, it is hard to determine the amount of clusters.
Since the clustering method by performing k-means clustering was not successful, we performed a principal component analysis for dimensionality reduction as another method for identifying clusters or patterns while preserving as much information of the original data set as possible.
By plotting a scree plot of the variation percentage, we can display how much variation a principal component captures from the original data.
As seen in the scree plot, the first three or four PCs have capture most of the information since the curve bends at an “elbow” at PC4 - this is our cutting off point.
In order to color our clusters according to sample annotations, we created a color vector by dissecting the metadata matrix by our PCA data samples. Additionally we were fixing some labeling problems.
Subsequently we created the color vectors for coloring according to drug and tissue type.
To determine clusters of samples based on their similarity, we plotted PC1 and PC2 as well as a PC2 and PC3 while coloring by tissue type and drug respectively.
As we can see, in contrast to the k-means method, clusters can be determined. Respecting the color annotation, it is clear to say that the samples cluster according to drug treatment.
Unfortunately, we can not use them as biomarkers so we search for our own biomarkers.
## GDF15 NUPR1 DDIT3 TRIB3 ATF3 ASNS
## 0.7150148 0.6843715 0.5794749 0.5753359 0.5603209 0.5305063
GDF3 Growth Differentiation Factor 3 belongs to the transforming growth factor beta (TGF-beta) superfamily and has some intrinsic activity.
NUPR1 stands for nuclear protein, transcriptional regulator, 1. It regulates the transcription during stress signals by empowering cells with resistance to stress.
DDIT3 encodes for the DNA damage-inducible transcript 3 protein, a member of the CCAAT/enhancer-binding protein which is a transcription factor. It is important for the response of cell stresses. The DNA damage-inducible transcript 3 protein induces cell cycle arrest and apoptosis in response to ER stress.
TRIB3 stands for Tribbles Pseudokinase 3. The kinase is induced by the transcription factor NF-kappaB. It is essential for cell proliferation.
ATF3 stands for activating transcription factor 3. It belongs to the mammalian activation transcription factor/cAMP responsive element-binding (CREB). ATF-3 is induced by physiological stress.
ASNS stands for Asparagine Synthetase: The enzyme catalyzes the production of the amino acid L-asparagine.
##
## Paired t-test
##
## data: before and after
## t = -1.5497e-13, df = 718140, p-value = 1
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.0003680021 0.0003680021
## sample estimates:
## mean of the differences
## -2.909656e-17
## Marker pvalue
## 1: GDF15 2.612447e-12
## 2: NUPR1 1.327187e-11
## 3: DDIT3 9.979311e-20
## 4: TRIB3 6.995573e-20
## 5: ATF3 2.142510e-12
## 6: ASNS 2.207251e-14
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000000 0.0002317 0.0756336 0.2421297 0.4428285 0.9998899
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0007514 0.0090894 0.0138573 0.0197222 0.0226351 0.4897715
## [1] 3325 54
## [1] 6627.055
## [1] 6159.495
Since the clustering method by performing k-means clustering again was not successful, we performed a principal component analysis over the selected Lapatinib samples for identifying clusters or patterns while preserving as much information of the original data set as possible.
By plotting a scree plot of the variation percentage, we can display how much variation a principal component captures from the original data.
As seen in the scree plot, the first four PCs have capture most of the information since the curve bends at an “elbow” at PC4 - this is our cutting off point.
In order to color our clusters according to sample annotations, we created a color vector by dissecting the metadata matrix as well as the cell-line-annotation matrix by our PCA data samples.
Subsequently we created the color vectors for coloring according to drug and microsatellite instability status.
To determine clusters of samples based on their similarity, we plotted PC1 and PC2 as well as a PC2 and PC3 while coloring by microsatellite instability status and drug respectively.
As we can see, rarely any clusters can be determined. In Addition, the color annotation does not reveal any pattern of clustering.
Modeling of a linear relationship between cell doubling time and inoculation density.
# Spearman correlation
cor(Cellline_Annotation$Doubling_Time, Cellline_Annotation$Inoculation_Density, method = "spearman")
## [1] 0.5674381
##
## Call:
## lm(formula = Cellline_Annotation$Inoculation_Density ~ Cellline_Annotation$Doubling_Time)
##
## Coefficients:
## (Intercept) Cellline_Annotation$Doubling_Time
## 9110.9 169.4
As you can see from the plot, there is a positive correlation between inoculation density and doubling time. This means that the higher the inoculation density, the higher the doubling time, which suggests that the tumour grows slowly because the cells need more time to divide.
As mentioned in our presentation, we want to create a model to predict GI50-values and thus to predict, if Lapatinib is a good choice. Therefore, we took several datasets and performed three regression analyses.
The first linear model tries to predict the GI-50 value among the data of the Inoculation density.
The second linear model tries to predict the G-50 value among the data of the Doubling-time.
Finally, we did a multiple regression with both datasets to predict GI50-values.
The following table shows important values.
| Inoculation_Density | Doublingtime | Multiple regression | |
|---|---|---|---|
| R-squared-value | 0.0013505 | 0.0913649 | 0.0044753 |
| F-statistic | 0.0622076 | 4.6253841 | 0.0921552 |
| RMSE of test-dataset | 0.5192093 | 0.6574516 | 0.7987091 |
The data is hard to discuss, because we used random values for our test and training dataset. Hence, different values on every run are obtained. We could use “selected values” (First 20) but this would be kind of cheating.
As we read in the paper “Antitumor and antiangiogenic effect of the dual EGFR and HER-2 tyrosine kinase inhibitor lapatinib in a lung cancer model. (Diaz et al., 2010)”, Lapatinib and Erlotinib are said to have similar effects. To verify this, we first looked at the correlation of the drugs and then created a graph showing the difference between GI50 for a certain cell line and the middle GI50 for both drugs.
Erlotinib and Lapatinib have a strong correlation.
The difference from the GI50 for a particular cell line and the mean GI50 is plotted here for Erlotinib and Lapatinib.
## [1] 0.6528188
A Pearson correlation coefficient of ~ 0.65 confirms that these patterns are similar. One reason for this is that both Erlotinib and Lapatinib are EGFR inhibitors. Cell lines that were more sensitive are displayed as bars that project to the right of the mean. Cell lines that were less sensitive are displayed with bars projected to the left.
To answer our more specific question, whether Lapatinib also has an effect on lung cancer, we will only look at the cell lines in lung cancer.
The difference from the GI50 for cell line from lung tissue and the mean GI50 is plotted here for Erlotinib and Lapatinib.
## [1] 0.9609488
A pearson correlation coefficent of ~ 0.96 suggests that Lapatinib has a similar effect on lung cancer as Erlotinib.
##
## Fligner-Killeen test of homogeneity of variances
##
## data: log2FoldChange by Drug
## Fligner-Killeen:med chi-squared = 0.75387, df = 1, p-value =
## 0.3853
As a proven fact, Lapatinb crosses the brain-blood barrier. Since breast cancer tends to spread and form metastasis in brain cells and Lapatinib is used in anti-breast cancer therapy, we wanted to analyse our data for similar effects after Lapatinib treatment in breast and cns tissue cells.
At first, we selected Lapatinib response genes by creating matrices with cns- and breast-cell line fold change values.
Next, we performed a paired t-test of treated and untreated cns and breast samples, respectively, to determine statistical significant fold change values. Since we are performing multiple testing over different cell lines from breast and cns tissues, we want to adjust our p value by performing a Benjamini Hochberg correction.
For visualization of the biological significance (fold change values) and the statistical significance (adjusted p-values), we plotted our data in an interactive volcano plot.
To visualize the expression patterns of the top peak genes present in both tissue types we calculated a heatmap, comparing the expression profiles of the two tissue types.
For better visualization of the expression patterns of the top peak genes among breast and cns tissues we computed a heatmap of the mean fold changes for direct comparison.
The heatmap shows, that the top expression genes are regulated similarly in breast and in cns cells after treatment with Lapatinib.
To gather a more detailed conception of the gene expression regulation values (aka the fold change) we also computed a comparing bar plot showing the exact fold change values.
At last, we investigated the expression regulation in both tissue types by calculating the pearson correlation between the fold change values of the top peak genes.
## [1] 0.921812
The obtained correlation value of ~ 0.92 indicates that Lapatinib treatment yields a similar effect on gene expression in breast and cns tissue cells and therefore might be possible to treat cancer in brain metastasis.